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Abstract 

Count data, for example the number of observed cases of a disease in a city, often arise 
in the fields of healthcare analytics and epidemiology. In this paper, we consider performing 
regression on multivariate data in which our outcome is a count. Specifically, we derive log- 
likelihood functions for finite mixtures of regression models involving counts that come from a 
Poisson distribution, as well as a negative binomial distribution when the counts are significantly 
overdispersed. Within our proposed modeling framework, we carry out optimal component 
selection using the information criteria scores AIC, BIC, CAIC, and ICOMP. We demonstrate 
applications of our approach on simulated data, as well as on a real data set of HIV cases in 
Tennessee counties from the year 2010. Finally, using a genetic algorithm within our framework, 
we perform variable subset selection to determine the covariates that are most responsible for 
categorizing Tennessee counties. This leads to some interesting insights into the traits of counties 
that have high HIV counts. 


1 Introduction 


Count data often arise in healthcare and epidemiology data sets. For example, the number of 
outcomes observed (i.e. cases of a disease) in a specific group of people (i.e. citizens residing in 
Knox county). We tend to treat count data as realizations from a Poisson distribution, so that the 
probability of observing y events given an expected number of events, is precisely 


P[y = y,y\ 


^^exp(-^) 

y! 


( 1 ) 


In regression analysis, Poissonian data are most often related to a linear combination of inde¬ 
pendent variables (i.e., covariates, exposures) via a log-link function. Hence, the Poisson regression 
model is defined as 

p 

ln(E[y]) = /3o + X/5iV> (2) 

f=i 

where (dj, j = are the measured effects of k indpendent exposures, X. For a binary 

exposure, Xj, the amount exp(/3j) is the increase (or decrease) in outcome incidence for subjects 
in which the exposure was observed, relative to subjects in which the exposure was not observed 
|8]. The amount exp(/3o) is the expected number of counts in baseline subjects. Researchers are 
responsible for determining the criteria defining a baseline measurement. 

A major assumption in Poisson regression modeling is that the mean and variance of the 
observed counts are equivalent. When this assumption is violated, as is often true of real data. 
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we say the data are overdispersed. A negative binomial distribution may more appropriate than a 
Poisson for describing overdispersed count data, due to its excessive tail behavior [8]. 

The difference between negative binomial regression (also referred to as NB-2 models) and Pois¬ 
son regression comes from our treatment of the observed count data. We still make the assumption 
that these counts come from a Poisson distribution, but we make the added assumption that the 
mean number of events, /r, comes from a 2-parameter Gamma distribution. 

1.1 Mixture models 

Finite mixture (FM) modeling has emerged within the past 20 years as a popular means for 
handling unsupervised classification tasks. The underlying principle behind mixture modeling is 
that we treat our observed data as having been sampled from a convex sum of distributions mm- 
This concept is particularly useful for explainining overdispersion in count data as being due to an 
underlying heterogeneous population. 

FM modeling has appeared in tandem with Poisson regression models. The extent of this 
type of research has mostly been limited to zero-inflated, zero-truncated, and hurdle-type models 
12]. These models are designed to handle zero-counts, and are largely inappropriate for explaining 
heterogeneities due to covariates. Papastamoulis et al. (2014) described a methodology that can 
address covariate heterogeneity using FMs of Poisson regression models in the context of high- 
throughput sequencing data m- Their EM algorithms for estimating G > 10 mixtures are freely 
available online from GRAN. 

In addition. Park and Lord (2009) have developed FMs of Poisson and negative binomial regres¬ 
sion models for explaining hetergeneities in vehicle crash data [12]. Of the modeling frameworks 
that have been suggested in the literature, theirs is most similar to ours in that they proposed using 
the information criteria AIG, BIG, and DIG to choose the optimal number of mixing components. 
However, these criteria did not give conclusive results, and so they were forced to subjectively 
choose a mixture of 2 NB-2 models. In the proceeding sections, we will discuss potential rea¬ 
sons for these inconclusive results. For more information on finite mixtures of negative binomial 
regression models, see also Zou et al. (2013) [T7| . 

1.2 Model selection 

Model selection exists as an alternative approach to classical hypothesis-drive statistics which 
require distributional assumptions and an arbitrary selection of a confidence level to evaluate 
p-values. Akaike was the first to propose a criterion for conducting model selection when he 
introduced his celebrated Akaike information criterion (AIG) in 1973. The formula, which involves 
a tradeoff between a candidate model’s lack of fit given by the maximized log-likelihood function, 
and a penalty term for the number of parameters, n^, is shown below: 

AIG =-21ogL(0|A)+ 2nfc. (3) 

When we compare 2 or more models that have been fit to the same set of data, we prefer to choose 
the model that minimizes the AIG score. 

As it turns out, the original penalty term that Akaike proposed, 2nk, is not enough to prevent 
model overfitting [4j. Numerous other information criteria (IG) have been proposed, including 
but not limited to Schwarz’s Bayesian Criterion (SBC or BIG) [H], the deviance information 
criterion (DIG) (which we will not consider here), the consistent form of AIG (CAIC) |3], and the 
information-theoretic measure of complexity (ICOMP) [5] which involves evaluating the inverse 
Fisher’s information matrix (IFIM). The reader is directed to the respective citations for derivations 
of the formulae. We will point out that each of the criteria we consider in this paper all share the 
lack-of-fit term in common - what makes them different is how they penalize overparameterization. 
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2 Methods 


2.1 Log-likelihood functions 

Our ultimate goal when we carry out model selection in this framework is to determine the optimal 
number of components in a FM model, given a hxed maximum number of components (see [5] for 
heuristics on choosing this number). This optimal number is reflective of the number of subpop¬ 
ulations composing the entire population from which we have sampled our data. Before we can 
perform model selection, we must be able to generate the various criteria discussed previously. To 
do so requires that we compute the log-likelihood function. 

Recall equation Q when dealing with Poissonian data whose mean and variance are similar. It 
follows that E[y|X,/3] = exp(/3^X). Since our outcome is sampled from a convex sum of Poisson 
distributions we have the following probability for observing n-dimensional count data, V: 


i=lg=l 

In equation Q, G is the total number of components in the FM model; is the weight assigned 
to component g; and, /3g refers to the effects in the g-th component, as this will vary within 
components. It follows directly from equation Q that the log-likelihood is 

log(T(^,vr;y,X)) = ^ I ^ [log(7rg)-hexp(-/3jxi) - log(yi!)] >. (5) 

i=l [g=l J 

When data are signihcantly overdispersed, recall that we assume our n counts, V, come from a 
Poisson distribution whose mean parameter, A, is Gamma-distributed. This additional constraint 
gives us the following probability density for Y: 


n G 

P[y;X,/3,7r,a] = nE 

i=lg=l 


VT, 


F {yi + 1/ag) I otg^l 
' r(l/ag)?/i! 


Vi 


1 -F OgPjXi 


1 -F agjijxi 


( 6 ) 


The interpretation of vr and /3 follows as before. The a* are overdispersion parameters, satisfying 
var(y) = (1 -b aff"X)ff"X (note: mean(y) = ff"X). The quadratic dependence of the variance 
on the mean is why we refer to these as NB-2 models. 

Finally, following directly from equation ([^, the log-likelihood function for the negative bino¬ 
mial regression model is shown below [3]: 


log(L(/3,7r,a;y,X)) 


E 


G 

E 

9=1 


log F y + 



log F 



+ yi (log(ag/3jxi) 


- log(l -b ag/3jxi)) - log(l 


OglSjxi)-logiyil) |. 


( 7 ) 

Note that when a = 0, the data are not overdispersed and so the assumption that mean(y) = 
var(y) is satisfied. Using our methodology, if the overdispersion parameter is below a certain 
threshold, we elect to perform Poisson regression, otherwise we perform NB-2 regression. This is 
especially important for computing the log-likelihood functions once we carry out scoring. 


2.2 Initialization 

The choice of an appropriate initialization scheme was a source of great concern for Papastamoulis 
et al. in their algorithm m- They used a “Small-EM” strategy from Biernacki et al. (2003), 
which was necessary for dealing with a large number of component mixtures (G >> 5) [2]. 

We are not interested in exploring the possibility of more than G = 5 components unless 
there is some justifiable epidemiological reason for doing so. Because of this, we opted for using 
a Poisson mixture model to perform unsupervised classification of our observed counts (see |in)l. 
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As a result, our methodology involves first sorting the count data into G < 5 groups, and then 
assigning observations in each group to an initial Poisson or NB-2 regression model. 

3 Results 

All numerical simulations were carried out in MATLAB 2015a. We used ArcGIS 10.2 for generating 
the map of Tennessee counties. 

3.1 Simulated data 

We begin using our proposed model framework to choose the optimal number of components in 
a simple set of simulated data. Each observation in the data is an ordered pair, {x,y), where 
X ~ N{yi, 1), and y ~ Poisson(Aj), i = 1,... ,4. The parameters yi and Aj for all i are randomly 
generated integers between 1 and 50. In this way, each observation comes from 1 of 4 potential 
groups. 

We tried fitting FMs of 1 to 5 Poisson regression models to these data; their scores are shown 
in Table We see that all 3 scores select G = 2 as the optimal number of components. The 
regression models corresponding to the solution when G = 2 are shown in the right pane of Figure 
The regression models each seem to do well modeling these data. 


G 

AIC 

CAIC 

ICOMP 

1 

2931.14 

2939.74 

2944.77 

2 

1392.98 

1414.47 

1416.78 

3 

1400.08 

1434.46 

1434.19 

4 

1428.51 

1475.79 

1472.49 

5 

1455.42 

1515.59 

- 


Table 1: IC scores for mixtures of G = 1,..., 5 Poisson regression models applied to the simulated 
data shown in FigureAll three of the scores are minimized for G = 2. Our regression framework 
failed to produce a stable IFIM when G = 5, so no ICOMP score was computed. 


3.2 Tennessee HIV counts 

We now use real data taken from a 2014 county-level study on persons living with HIV in the 
U.S. South [9]. Readers are encouraged to refer to [9] for a summary of the entire data set. For 
demonstrating our methodology, we are interested only in the TN counties that were included in 
the study. These data are summarized in Table For reference, the abbreviations we use when 
referring to individual covariates are summarized below in Table 

The count data are largely overdispersed (mean(HIV) = 166.45, var(HIV) > 500,000), so we 
proceed by using mixtures of NB-2 models. In Figurewe see that the model selected 4 regression 
models, which is indicative of 4 separate homogeneous subpopulations within Tennessee counties. 

Given that there are 4 subpopulations, we used a genetic algorithm (GA) to select the subset 
of variables from the original 8 that contributes most to sorting the data. GA have been used 
numerous times for carrying out variable subset selection, although never for Poisson or NB-2 
regression (see PlZllle] for GA applied to other regression models). The results for performing 
GA subsetting in our model are summarized in Table We see that variables 4 and 7 appear in 
each of the subsets most frequently chosen, and the model fitted with only variables 1, 3, 4, 5, and 
7 is the best at minimizing the IC scores. 

In the interest of parsimony. Table shows coefficients from each of the 4 components in a FM 
model fitted with just the urban indicator and the non-Hispanic white proportion. We see that 
the proportion of whites in a county is significantly protective against HIV count in 3 out of 4 
components, while the urban indicator is a significant risk for HIV count. 
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No. 

ID 

Description 

Data type 

- 

HIV 

Number of persons living with HIV 

Gontinuous (count) 

1 

SGH 

Proportion of persons with less than a HS education 

Gontinuous 

2 

POV 

Proportion of persons living below the poverty line 

Gontinuous 

3 

ING 

Natural logarithm of median income 

Gontinuous 

4 

URB 

Urbanicity indicator (population <50,000/>50,000) 

Gategorical (2) 

5 

UMP 

Unemployment rate 

Gontinuous 

6 

NHB 

Proportion of Non-Hispanic Black persons 

Gontinuous 

7 

NHW 

Proportion of Non-Hispanic White persons 

Gontinuous 

8 

HSP 

Proportion of Hispanic persons 

Gontinuous 


Table 2: Description of the variables in the Tennessee HIV data. The outcome is HIV. Categorical 
variables have the number of levels listed in parentheses. 


Subset 

Rel. freq. 

AIG 

GAIG 

SBG 

1, 3, 4, 5, 7 

17.00% 

708.76 

776.89 

756.89 

1, 2, 3, 4, 5, 7 

10.00% 

703.51 

781.86 

758.86 

4, 7 

10.50% 

849.86 

887.33 

876.33 


Table 3: Relative frequencies and IC scores for variable subsets chosen using the GA. 


In components 1 and 4, urban indicator was not included in the regression analysis because all 
of the counties belonging to those components had urban indicators of 0 or 1, respectively. Based on 
Tablej^ this makes sense. We see that component 1 contains counties with HIV counts less than 15, 
while component 4 contains counties with counts greater than 100. The proportion of individuals 
with a high school diploma or lower decreases in each component, as do the non-Hispanic white 
proportions. Interestingly, the poverty rate and unemployment rates also decrease. At the same 
time, the urban indicator, and the non-Hispanic black and Hispanic proportions increase with each 
component. 

4 Discussion 

Our results indicate that the proportion of white persons in a county seems to be a significant 
predictor of HIV count in that county. This is slightly inconsistent with the conclusions of Gray, 
et al. [9] , who found that the proportion of black persons was the most significant contributor to 
HIV rate. That being said, our GA the non-Hispanic black variable was selected numerous times 
as a significant variable, however it was not one the most significant (results not shown). 

That urban indicator was selected as a significant variable comes as no surprise. This result 
tells us that as the population increases, we should also expect to see an increase in HIV count. 
For numerous different reasons, this result makes perfect sense. Moreover, in Figure we see 
that the counties with the highest HIV counts correspond to the some of the largest cities in 
Tennessee, including Knox (Knoxville), Davidson (Nashville), Hamilton (Ghattanooga), and Shelby 
(Memphis). 

It was also interesting to find that poverty rates and education levels tended to be negatively 
associated with HIV counts. Our best explanation is that, at least in Tennessee, population size 
is a confounder for education level and poverty rate. That is to say that as the population of a 
county increases, we also expect to see access to jobs and education increasing. It would be worth 
investigating these results further to determine what potential public health impacts exist. 
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5 Conclusion 


We have proposed a novel framework for analyzing data sets in which the response variable is 
a count. Our approach allows us to systematically identify homogeneous subpopulations arising 
in the dataset, based on the covariates and their contributions to the count data in a Poisson 
regression model. We are also able to select the covariates that contribute most to determining the 
aforementioned subpopulations. Using this approach on a set of HIV count data in Tennessee has 
led to reasonable and quantifiable results. This would suggest that we could apply this approach 
to other sets of data involving counts of other infectious diseases, such as influenza, dengue fever, 
or even the Ebola virus disease. 
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6 Figures 



Figure 1: On the left, we see the data in its original form, with each observations coming from 1 of 
4 distinct groups. On the right, we see the same data, with the mixture of two Poisson regression 
models that was chosen by ICOMP overlaid. Different colors and symbols indicate observations 
belonging to different mixtures. 



Figure 2: Information criteria scores for hnite mixtures of G = 1,...,5 NB-2 models using the 
Tennessee county-level HIV data. We see that all 3 scores are minimized for G = 3. 



95 - 6408 


Figure 3: HIV cases by county in Tennessee. 
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7 Tables 


Name 

Mean 

St. Dev. 

Min 

Max 

HIV 

166.45 

770.43 

2.00 

6408.00 

SCH 

0.21 

0.05 

0.05 

0.30 

POV 

0.17 

0.04 

0.05 

0.30 

INC 

10.55 

0.20 

10.01 

11.42 

URB 

0.31 

0.46 

0.00 

1.00 

UMP 

0.11 

0.03 

0.05 

0.18 

NHB 

0.07 

0.10 

0.00 

0.52 

NHW 

0.88 

0.12 

0.39 

0.99 

HSP 

0.03 

0.02 

0.00 

0.11 


Table 4: Summary statistics of Tennessee county-level HIV data taken from |9j. 


Variable 


S. E. : 

2.5% 

97.5% 

Component 1 

URB 

- 

- 

- 

- 

NHW 

-2.14 

1.30 

-4.69 

0.41 

Int 

4.06 

1.20 

1.70 

6.41 


Component 2 



URB 

0.40 

0.091 

0.22 

0.58 

NHW 

-1.85 

0.51 

-2.85 

-0.85 

Int 

4.84 

0.45 

3.96 

5.72 


Component 3 



URB 

0.35 

0.11 

0.12 

0.57 

NHW 

-1.05 

0.39 

-1.81 

-0.29 

Int 

4.94 

0.28 

4.38 

5.50 


Component 4 


URB 

- 

- 

- 

- 

NHW 

-7.12 

1.40 

-9.86 

-4.37 

Int 

11.57 

1.08 

9.46 

13.68 


Table 5: Coefficients from the Poisson regression models in each component, using the variables 
selected by GA. 
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Variable 

Mean 

St. dev. 

Min. 

Max. 

Component 1 

HIV 

8.03 

4.00 

2.00 

15.00 

SCH 

0.24 

0.04 

0.16 

0.30 

POV 

0.20 

0.04 

0.10 

0.30 

INC 

10.43 

0.15 

10.01 

10.71 

URB 

0.00 

0.00 

0.00 

0.00 

UMP 

0.12 

0.03 

0.07 

0.17 

NHB 

0.03 

0.05 

0.00 

0.26 

NHW 

0.93 

0.06 

0.68 

0.99 

HSP 

0.02 

0.02 

0.00 

0.09 


Component 1 

) 


HIV 

27.55 

8.89 

15.00 

45.00 

SCH 

0.21 

0.03 

0.15 

0.30 

POV 

0.17 

0.03 

0.12 

0.23 

INC 

10.54 

0.10 

10.30 

10.81 

URB 

0.27 

0.45 

0.00 

1.00 

UMP 

0.11 

0.02 

0.08 

0.18 

NHB 

0.05 

0.07 

0.00 

0.41 

NHW 

0.89 

0.08 

0.56 

0.97 

HSP 

0.03 

0.03 

0.01 

0.11 


Component 3 


HIV 

72.92 

16.84 

45.00 

94.00 

SCH 

0.17 

0.04 

0.12 

0.26 

POV 

0.14 

0.05 

0.09 

0.23 

INC 

10.70 

0.21 

10.40 

11.02 

URB 

0.54 

0.52 

0.00 

1.00 

UMP 

0.10 

0.02 

0.07 

0.15 

NHB 

0.14 

0.15 

0.01 

0.49 

NHW 

0.81 

0.15 

0.45 

0.94 

HSP 

0.03 

0.01 

0.02 

0.06 


Component 4 


HIV 

1051.31 

1913.88 

101.00 

6408.00 

SCH 

0.13 

0.03 

0.05 

0.18 

POV 

0.13 

0.03 

0.05 

0.17 

INC 

10.80 

0.21 

10.60 

11.42 

URB 

1.00 

0.00 

1.00 

1.00 

UMP 

0.09 

0.02 

0.05 

0.12 

NHB 

0.16 

0.15 

0.02 

0.52 

NHW 

0.75 

0.16 

0.39 

0.94 

HSP 

0.05 

0.02 

0.01 

0.10 


Table 6: Summary statistics for counties belonging to each component. 
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